CMT 309 Data Science Portfolio

 Parts 1 and 2


 Part 1 - Pre-processing and exploratory analysis

Instructions:

Before submitting,

In this part you will be working with the listings.csv data. To help you wrap around your head we will first provide some information on the main columns in the data.

Dataframe columns description:

The next two cells load the listings.csv file into a dataframe. Once loaded, start working on the subsequent questions.

 Question 1a

 Question 1b

 Question 1c: Answering questions.

You do not need to write the answer. In each cell, provide the Pandas code that outputs the result. Each answer can be given with 1-2 lines of Python code. Example question and answer:

# What is the total number of rows in the dataframe?
df.shape[0]

Now over to you:

 Question 1d: Exploratory analyses

Produce a barplot of the average nightly price per neighbourhood as instructed in the Coursework proforma:

Plot a correlation matrix as instructed in the Coursework proforma:

Plot a geographical distribution as instructed in the Coursework proforma:


 Part 2: Statistical analysis and recommender system

 CMT 309 Data Science Portfolio

Instructions:

Before submitting,

 Question 2a: Linear regression and t-tests

T-test questions:

Which room types are significantly different in terms of nightly price?

YOUR ANSWER (1-2 sentences): For the combinations 'Entire home/apt' with 'Private room', 'Entire home/apt' with 'Hotel room'", 'Entire home/apt' with 'Shared room' and 'Private room' with 'Hotel room' we observe for alpha = 0.01 that they are significantly different in terms of nightly price.

Do the significances change if you perform Bonferroni correction to the alpha level: https://en.wikipedia.org/wiki/Bonferroni_correction ?

YOUR ANSWER (1-2 sentences): After performing Bonferroni correction we observe that for the combinations 'Entire home/apt' with 'Private room' and 'Private room' with 'Hotel room' for alpha = 0.01/12 that they are significantly different in terms of nightly price. We divide the alpha by 12 as this is the number of combinations for the different type of rooms.

 Question 2b: Linear regression with variable selection

Provide a short justification (2-3 sentences) for your choice of variables.

YOUR ANSWER: We will not take into account the columns 'id', 'name', 'host_id', 'host_name', 'host_since' as their information will only add biases to our model. Also we will drop the columns with more than 500 NaNs as we want to drop NaNs and the categorical with many different values to not highly increase dimensionality 'neighbourhood_cleansed' 'property_type'. For the remaining columns we will compute VIF and remove columns with high Multicollinearity.

Question 2c: Recommendation systems

Recommend a neighbourhood given a budget

 Price recommender for hosts